Groupby operations

Some imports:


In [ ]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
try:
    import seaborn
except ImportError:
    pass

pd.options.display.max_rows = 10

Some 'theory': the groupby operation (split-apply-combine)

By "group by" we are referring to a process involving one or more of the following steps

  • Splitting the data into groups based on some criteria
  • Applying a function to each group independently
  • Combining the results into a data structure

Similar to SQL GROUP BY

The example of the image in pandas syntax:


In [ ]:
df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df

In [ ]:
df.groupby('key').aggregate('sum')  # np.sum

In [ ]:
df.groupby('key').sum()

And now applying this on some real data

These exercises are based on the PyCon tutorial of Brandon Rhodes (so all credit to him!) and the datasets he prepared for that. You can download these data from here: titles.csv and cast.csv and put them in the /data folder.


In [ ]:
cast = pd.read_csv('data/cast.csv')
cast.head()

In [ ]:
titles = pd.read_csv('data/titles.csv')
titles.head()
EXERCISE: Using groupby(), plot the number of films that have been released each decade in the history of cinema.

In [ ]:

EXERCISE: Use groupby() to plot the number of "Hamlet" films made each decade.

In [ ]:

EXERCISE: How many leading (n=1) roles were available to actors, and how many to actresses, in each year of the 1950s?

In [ ]:

EXERCISE: Use groupby() to determine how many roles are listed for each of The Pink Panther movies.

In [ ]:

EXERCISE: List, in order by year, each of the films in which Frank Oz has played more than 1 role.

In [ ]:

EXERCISE: List each of the characters that Frank Oz has portrayed at least twice.

In [ ]:

Transforms

Sometimes you don't want to aggregate the groups, but transform the values in each group. This can be achieved with transform


In [ ]:
df

In [ ]:
def normalize(group):
    return (group - group.mean()) / group.std()

In [ ]:
df.groupby('key').transform(normalize)
EXERCISE: Calculate the ratio of number roles of actors and actresses to the total number of roles per decade and plot this for both in time (tip: you need to do a groupby twice in two steps, once calculating the numbers, and then the ratios.

In [ ]:


In [ ]:


In [ ]:

Value counts

A useful shortcut to calculate the number of occurences of certain values is value_counts (this is somewhat equivalent to df.groupby(key).size()))

For example, what are the most occuring movie titles?


In [ ]:
titles.title.value_counts().head()
EXERCISE: Which years saw the most films released?

In [ ]:

EXERCISE: Plot the number of released films over time

In [ ]:

EXERCISE: Plot the number of "Hamlet" films made each decade.

In [ ]:

EXERCISE: What are the 11 most common character names in movie history?

In [ ]:

EXERCISE: Which actors or actresses appeared in the most movies in the year 2010?

In [ ]:

EXERCISE: Plot how many roles Brad Pitt has played in each year of his career.

In [ ]:

EXERCISE: What are the 10 most film titles roles that start with the word "The Life"?

In [ ]:

EXERCISE: How many leading (n=1) roles were available to actors, and how many to actresses, in the 1950s? And in 2000s?

In [ ]:


In [ ]:


In [ ]: